Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.
We are interested to know How the worlds news media is covering the COVID-19 pandemic? Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.
A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.
We intend to solve the huge flow of information called information overload which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. Its costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.
We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.
Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited and interpretable for topic modeling on our dataset ?
3. How does the topic model perform with different features, namely Term frequency Inverse document frequency (Tf - Idf) along with Bag of Words and Bag of words (TF) by itself.
Data source
For our dataset we required news articles that spoke about the ongoing coronavirus pandemic. In our search, we came across the Gdelt Project, that contained a compilation of URLs and brief snippets of worldwide English language news coverage mentioning Covid-19. It contains data from the the period November 1, 2019 through March 26, 2020. Gdelt dataset: http://data.gdeltproject.org/blog/2020-coronavirus-narrative/live_onlinenews/MASTERFILELIST.TXT
Scraping Method
On digging deeper into the dataset we realized that only snippets of the news articles were included.The snippets were chosen by performing a keyword search for the given terms: Cases, Covid19, Falsehoods, Masks, Panic, Prices, Quarantine, Shortages, SocialDistancing, Testing and Ventilators; and selecting the paragraph with its first occurrence. In addition to the presence of one of the given terms, either the sentence itself or the ones before and after them should also contain the term Coronavirus or Covid-19, thus ensuring that the news article is realted to coronavirus.
The Gdelt dataset had news articles related to coronavirus, but just a snippet wouldn’t be suffiecient to understand the underlying topic of an article. Hence, we decided to scrape the articles by ourselves by using urls corresponding to each article of the Gdelt dataset.
The dataset contained several files, each containing articles extracted on a particular day, having a particular keyword. As considering all the articles in each file would be computationally too heavy & infeasible, we agreed on creating a dataset having around 20000 records. We realize the topics discussed during the initial period of the pandemic and in the months to follow must have evolved. In order to capture the wide array of topics over the duration of 5 months, we first downloaded all files. Then for all the files belonging to a keyword we extract certain records. This was repeated for all keywords. Thus at the end of the extraction process we had around 20000 news articles as our final dataset.
Cleanup
As the content we extracted were from websites, it contained numerous html tags and special characters. In the preprocessing stage,we first converted the data to lower case. We then cleaned the data by removing the urls(www, http), punctuations, special charachters, stopwords and also stripped the whitespaces in it. Once the preprocessing was complete the preprocessed corpus was ready for analysis.
Storage
The dataset after preprocessing was stored in a csv file and uploaded on the drive. Dataset: https://drive.google.com/file/d/1qgQiIIi1yhXBj1jAOVz_2dhNT4C2i6bc/view?usp=sharing
…
Our Dataset is in text format and therefore we pre-processed it before performing any kind of exploratory analysis. This was required in order to clean it and remove unnecessary words or characters that would affect our analysis in any way.Pre-processing is one of the very important steps of Natural Language processing, because a well pre-processed data speeds up the computation time required for further analysis and also the quality of tokens and results tend to be higher compared to the poorly pre-processed data.
Steps taken for Pre-processing
Removed URL’s from the content
Replaced punctuations, numbers and any other characters apart from alphabets
Coverted Latin words to Utf-8
Conerted the text to lower case
Removed Stop words
Wordclouds are a representative of underlying words in any text or the news articles dataset in our case.We are interested in knowing the most prominent words in the corpus. To do so we generated wordclouds for 2 different models of Bag of Words, that are with Term Frequency and Term frequency - Inverse document frequency.
As we can see in the below wordcloud, news articles have been all about the coronavirus pandemic. The Terms with higher frequencies have bigger fonts.The words from Bag Of Words Model are more evident in the word cloud since they are weighted by the term frequency. The words in the TfIdf model is weighted according to the TF-IDF scale, so they look uniform.
wordcloud2(df_bow_content,shape = "star",size = 0.4)
…
wordcloud2(df_tfidf_content,shape = "star",size = 0.15)
…
To understand the most prominent terms in the article titles, we created a word cloud for the titles of the articles in the corpus.
wordcloud2(df_bow_title,shape = "star",size = 0.15)
…
The document length for all the documents are represented in the form of scatter plot. Most of the documents are in the range of 0-12500.
doc_size<-ggplot(test_df, aes(x=ID, y=doc_length)) +geom_bar(stat="identity",aes(fill = ID))+theme_minimal()+ labs(y= "Size", x = "Document")
ggsave("Document_Size_Bar.png", plot = doc_size,height = 5, width = 7)
doc_size_scatter <- ggplot(test_df) + aes(x = X1, y = doc_length) +geom_point(size = 1L, colour = "#0c4c8a") +labs(x = "Document", y = "Document Length") +theme_minimal()
…
The unique words present in the document was represented via the density plot
density <- test_df %>%
ggplot( aes(x=unique_words)) +
geom_density(fill="#009E73", color="#F0E442", alpha=0.8)+ labs(y= "Document", x = "Frequency")
density + coord_cartesian(xlim=c(0,20000))
theme(axis.text.x = element_text(face = "bold", color = "#993333",
size = 12, angle = 45)
…
The below words were the top 10 most frequent occurring words.
docs <- Corpus(VectorSource(test_df$pre_process_content))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
params <- list(minDocFreq = 1,removeNumbers = TRUE,stopwords = TRUE,stemming = FALSE,weighting = weightTf,tokenize=UnigramTokenizer)
dtm <- DocumentTermMatrix(docs, control = params)
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm, 1, sum) #Find the sum of words in each Document
dtm <- dtm[rowTotals> 0, ]
dtm_uni_freq <-dtm %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
dtm_uni_freq_d <- data.frame(word = names(dtm_uni_freq), freq = dtm_uni_freq)
head(dtm_uni_freq_d, 10)
…
We now see the number of articles published over the duration of January to April
p <- test_df2 %>%
ggplot(aes(x=Date, y=Count,group = 1)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("Article Count") +
theme_ipsum() +
theme(axis.text.x = element_text(face = "bold", color = "azure4",
size = 8, angle = 90),panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p <- ggplotly(p)
Let us see the overall sentiment in the published articles
ggplot(test_df1, aes(x=ID, y=as.integer(sentiment))) +
geom_segment( aes(x=ID, xend=ID, y=0, yend=as.integer(sentiment), color=mycolor), size=1.3, alpha=0.9) +
theme_light() +
theme(
legend.position = "none",
panel.border = element_blank(),
) +
labs(y= "Sentiment", x = "Document")
…
In order to get a better understanding of the most prevalent emotions in the articles, we have visualized the strength of the emotion in the corpus
quickplot(Emotions,data=all_emot, weight=count, geom="bar",fill=Emotions, ylab="count")+ggtitle("Emotion Analysis")
…
The topic Modelling is an unsupervised method that is used to deduce the abstract topics discussed over a collection of documents. Since the aim of our project is to classify documents, we have used topic modeling as a means to label our data. Once we have the labeled data, the unseen test documents are classified based on the topic probabilities.
The first step in performing LDA is to deduce the optimal number of “topics”. This is achieved by using the perplexity measure.Since all the topics are represented by probabilities, we need to measure how well these distributions predict a sample, so we use perplexity. The perplexity measure is applied on LDA objects with k ranging from 10 to 30 for both the Bag Of Words model and the TF-IDF Model. The LDA object with the lowest K is deemed to be the best model, and k is deemed to be the optimal number of topics.
…
So in our case, the best model turned out to be the Baf Of Words model, with K=25 topics.
The LDA model was built for term frequency with k = 25 topics. Both the Gibbs Sampling and the Dot product was used for this purpose.
Model for term frequency with Gibbs sampling
LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
iterations = 200, burnin = 175)
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),],
method = "gibbs",iterations = 200, burnin = 175)
p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")
Model for term frequency with Dot product sampling
LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
iterations = 200, burnin = 175)
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),],
method = "dot",iterations = 200, burnin = 175)
p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")
So now each document has a probability associated with it with respect to the 25 topics. This acts as the labelled data for further prediction.
Once we have all the documents in the training set labeled, the next step is predicting the topic probabilities for the unseen test set. The predict method of LDA is used to predict the topic probabilities.
Predicting the topics using Term frequency with Gibbs sampling model
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "gibbs",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with Gibbs sampling model can be seen in the plot below.
…
Predicting the topics using Term frequency with Dot product sampling model
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with dot product sampling model can be seen in the plot below.
Perplexity Score
The next step is to evaluate the model, for which we used log likelihood. Higher the value, the better is the model. The plot below shows the log likehood for the two models.
Perplexity Score
From the plot it is evident that the bag of words model (term frequency) performs better and hence we have used this model here forth.
Once the prediction is done, we now have topic probabilities for all the documents. It is interesting to find similarities in-between topics, so we are clustering the documents based on their topic probabilities.
To perform clustering, we need to decide on the optimal number of clusters. This was determined by using elbow curve The optimal number of clusters by elbow curve is 8.
#Reducing the dimensions via tsne
tsne <- Rtsne(doc_topics_gamma[,-1], perplexity = 30, pca = FALSE, check_duplicates = FALSE)
X <- data.frame(tsne$Y)
#Find best no. of clusters for 25 topics
wss <- (nrow(X)-1)*sum(apply(X,2,var))
for (i in 1:100) wss[i] <- sum(kmeans(X,iter.max = 50L,centers=i)$withinss)
plot(1:100, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Perplexity Score
Another approach to find optimal number of clusters used was silhouette coefficient. The silhouette coefficient is used to determine the inter and intra distance for all the points within the cluster to themselves and to the points in the other cluster. We evaluated this value for 8 & 15 clusters and the results can be seen in the plots below.
Perplexity Score
Perplexity Score
The silhouette coefficient for our cluster was 0.33. Given our dataset where all our documents are talking about coronavirus, its no wonder the value for silhouette coefficient is less as the distance between the cluster is negligle and thus the documents within them.
Finally, the articles were grouped into 8 clusters.
k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)
fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters
Convex Hull Plot for 8 clusters
Convex Hull Plot for 8 clusters
The entire document corpus has been visualized in the RBokeh graph. On hovering on the documents, it can be seen that the documents belonging to the same topics are relatively close to each other. However, some exception exists near the boundaries of each topic.Hovering over the documents, displays the title, URL, and the most dominant topic in it.